The purpose of this workflow is to compare different numbers of states specified for ChromHMM. We do this in several ways. First, we look for the maximum correlation between chromatin state and gene expression. Second, we look for state overlaps with genomic features.

This workflow can be run after 1.3_Chromatin_States_and_Expression.Rmd has been run for multiple state numbers. It uses the “Prop_Expression_Cor” files to find the maximum and minimum associations between proportion of each state and the expression of each transcript. It plots these maxima and minima across the states analyzed.

Source code

library("here")
library("pheatmap")
all.code.dir <- list.files(here("Code"), full.names = TRUE)
for(i in 1:length(all.code.dir)){
    all.fun <- list.files(all.code.dir[i], full.names = TRUE, pattern = ".R")
    for(j in 1:length(all.fun)){source(all.fun[j])}
}
is.interactive = FALSE
#is.interactive = TRUE

Collect results

Collect all results for each state. The results are stored in the Prop_Expression_Cor files in each analysis directory. These files hold the r values describing the correlation between the proportion of each state and expression level in inbred mice.

state.num <- list.files(here("Results", "ChromHMM"), full.names = TRUE)
state.names <- basename(state.num)
state.as.num <- as.numeric(sapply(strsplit(state.names, "_"), function(x) x[1]))
state_order <- order(state.as.num)
all.prop <- vector(mode = "list", length = length(state.num))
names(all.prop) <- basename(state.num)[state_order]
for(i in 1:length(state.num)){
    prop.file <- list.files(path = state.num[state_order[i]], pattern = "Prop_Expression_Cor", full.names = TRUE)
    if(length(prop.file) > 0){all.prop[[i]] <- readRDS(prop.file)}
}

Correlations with State Proportion

Calculate the median correlation between the proportion of each state and gene expression. Plot all state-expression correlations for each model.

The figure below shows how the states correlate with expresion in each different model. The 4-state model does quite well. The next two models add states without any correlation with expression. Starting with the 7-state model there is an additional state that correlates moderately highly with expression. There is also a jump up in the maximum correlation between a state and expression starting with this model. The 7-, 8-, and 9-state models are all similar in this regard, with the 9-state model having the highest maximum correlation between a state and expression. The 9-state model also adds a state with moderately negative correlation with expression. After this state, all states added are redundant in terms of correlation with expression, and add more parameters to our model. They potentially oversplit the states, especially with regard to modeling expression. Furthermore, there is a step down in the maximum correlation with expression starting with the 10-state model.

For these reasons, we selected the 9-state model as the best model for explaining gene expression.

all.prop.r <- lapply(all.prop, function(x) x$all.r)
all.prop.p <- lapply(all.prop, function(x) x$all.p)
med.prop.r <- lapply(all.prop.r, 
function(x) if(length(x) > 0){apply(x, 2, function(y) median(y, na.rm = TRUE))})

The plot below shows the correlation distributions for each model across all states. Each point shows the median correlation between a state and inbred expression across all genes. The maximum and minimum are fairly consistent across all models. The nine-state model has the highest maximum. As we get more states the number of states with poor correlation increases, but so does the number of states with higher correlations. They all so similar that it’s difficult to pick one as the best.

prop.col <- lapply(med.prop.r, function(x) colors.from.values(x, use.pheatmap.colors = TRUE))
## Loading required package: grid
plot.new()
plot.window(xlim = c(1, length(med.prop.r)), ylim = c(-0.25, 0.3))
for(i in 1:length(med.prop.r)){
    points(x = jitter(rep(i, length(med.prop.r[[i]]))), y = med.prop.r[[i]], pch = 16,
    col = prop.col[[i]])
}
axis(2)
par(xpd = TRUE)
text(y = rep(-0.3, length(med.prop.r)), x = 1:length(med.prop.r), 
labels = names(med.prop.r), srt = 90)
par(xpd = FALSE)
abline(h = 0, col = "gray")

Emissions and State Correlations

Which states across the different models have high and low correlations with expression?

emissions.dir <- list.files(here("Data", "ChromHMM"), full.names = TRUE)
emissions.files <- lapply(emissions.dir, 
function(x) get.files(x, want = c("emissions", ".txt"), full.names = TRUE))[state_order]
emission.mats <- lapply(emissions.files, function(x) read.table(x[1], 
row.names = 1, header = TRUE, sep = "\t"))
all.expr.cor <- NULL

for(i in 1:length(emission.mats)){
    cat("###", names(all.prop.r)[i], "\n")
    exprV <- med.prop.r[[i]]
    names(exprV) <- paste0(paste0(i+3, "-state-model_state"), names(exprV))
    expr.df <- data.frame(med.prop.r[[i]])
    all.expr.cor <- c(all.expr.cor, exprV)
    colnames(expr.df) <- "Expression_Correlation"
    expr.order <- order(expr.df)
    mark.order <- order(colnames(emission.mats[[i]]))
    pheatmap(emission.mats[[i]][expr.order,mark.order], annotation_row = expr.df, 
    cluster_rows = FALSE, cluster_cols = FALSE)
    cat("\n\n")
}

4_states_C

5_states_C

6_states_C

7_states_C

8_states_C

9_states_C

10_states_C

11_states_C

12_states_C

13_states_C

14_states_C

15_states_C

16_states_C

Emissions and State Correlations For All Models

for(i in 1:length(emission.mats)){
    rownames(emission.mats[[i]]) <- paste0(paste0(i+3, "_states"), rownames(emission.mats[[i]]))
    emission.mats[[i]] <- apply(emission.mats[[i]], 2, function(x) bin.vector(x, c(0, 0.5, 1)))
}

all.states <- Reduce("rbind", emission.mats)
full.mat <- cbind(all.states, all.expr.cor)
expr.order <- order(all.expr.cor)
mark.order <- order(colnames(all.states))

pheatmap(full.mat[expr.order,mark.order], cluster_rows = FALSE, cluster_cols = FALSE)

Correlation Distributions

The following plots show the distributions of correlations between states and gene expression for the states tested.

boxplot.r <- function(r.mats, p.mats, state.num){
    form.states <- paste0(state.num, "_states_C")
    state.locale <- which(names(r.mats) == form.states)
    
    r.mat <- r.mats[[state.locale]]
    p.mat <- p.mats[[state.locale]]

    plot.new()
    plot.window(xlim = c(0,(ncol(r.mat)+1)), ylim = c(-1, 1))
    for(i in 1:ncol(r.mat)){
        p.col <- colors.from.values(-log10(p.mat[,i]), use.pheatmap.colors = TRUE)
        not.na <- which(!is.na(r.mat[,i]))
        points(x = jitter(rep(i, nrow(r.mat))), y = r.mat[,i], col = p.col, 
        pch = 16, cex = 0.5)
        boxplot(r.mat[not.na,i], at = i, col = "lightgray", add = TRUE)
        }
    axis(2);mtext(side = 2, "Correlation Coefficient", line = 2.5)
    abline(h = 0)
    par(xpd = TRUE)
    text(x = 1:ncol(r.mat), y = -1.1, labels = paste0("State", 1:ncol(r.mat)), 
    srt = 90, adj = 1)
    mtext(side = 3, text = paste0(state.num, "-state model"), line = 1)
    par(xpd = FALSE)
}

show.emissions <- function(state.num){
    form.states <- paste0(state.num, "_states_C")
    emissions.file <- here("Data", "ChromHMM", form.states, paste0("emissions_", state.num, ".png"))
    cat("![](",emissions.file,")\n")
}
for(i in 1:length(state.as.num)){
    cat("###", state.names[state_order[i]], "\n")
    if(is.interactive){quartz()}
    boxplot.r(all.prop.r, all.prop.p,  state.as.num[state_order[i]])
    show.emissions(state.as.num[state_order[i]])
    cat("\n\n")
}

4_states_C

5_states_C

6_states_C

7_states_C

8_states_C

9_states_C

10_states_C

11_states_C

12_states_C

13_states_C

14_states_C

15_states_C

16_states_C

Genomic Overlaps

We next look at each set of states and how the state positions in the B6 animals overlap with B6 genomic coordinates. We used the genomic coordinate files included in ChromHMM, which includes coordinates for exons, TSS, TES, genes, and CpG islands. We downloaded additional files from the UCSC genome browser. To do that we went to [http://genome.ucsc.edu/cgi-bin/hgTables] and downloaded a few bed files by hand for different tracks.

We can use these enrichments to annotate the states we identified in ChromHMM.

enrichment.files <- list.files(here("Results", "Enrichments"), 
pattern = ".txt", full.names = TRUE)[state_order]

for(i in 1:length(enrichment.files)){
    cat("###", state.names[state_order[i]], "\n")
    if(is.interactive){quartz(width = 8, height = 4)}
    enrichment.map <- as.matrix(read.delim(enrichment.files[i], row.names = 1))
    pheatmap(enrichment.map, scale = "column", cluster_rows = FALSE, cluster_cols = FALSE)
    cat("\n\n")
}

4_states_C

5_states_C

6_states_C

7_states_C

8_states_C

9_states_C

10_states_C

11_states_C

12_states_C

13_states_C

14_states_C

15_states_C

16_states_C

pdf(here("Documents", "3.Manuscript", "3.Manuscript", "enrichment.pdf"), width = 5, height = 2.5)
pheatmap(t(enrichment.map[,c(1, 2,3,4,7,8,10,11,13)]), scale = "row", cluster_cols = FALSE)
dev.off()

We have selected the 9-state model for this analysis, so I will limit my discussion to that state here.

State 7, which has the highest positive correlation with expression localizes primarily to promoters, and transcription start sites (TSS), as well as to exons.

State 5, which is also positively correlated with expression localizes to enhancers.

State 3, which is negatively correlated with expression has similar localization to state 7, but is primarily enriched at transcription factor binding sites. It is also enriched at promoters.

State 1 is the absence of all measured marks, and is negatively correlated with expression. It is primarily associated with intergenic regions.

State 2 is also negatively correlated with expression. It is the presence of just H3K27me3. It does not strongly localize to any of the features represented here, but may be slightly more present in introns or up/downstream regulatory regions.

Conclusions

We used the data presented here to select the 9-state model for further analysis.